Skip to content

feat(q4_0): FP32→Q4_0 quantizer (loader-agnostic production)#651

Merged
michalharakal merged 1 commit into
developfrom
feature/q4_0-quantizer
May 30, 2026
Merged

feat(q4_0): FP32→Q4_0 quantizer (loader-agnostic production)#651
michalharakal merged 1 commit into
developfrom
feature/q4_0-quantizer

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Phase B — the produce side of Q4_0. Q4_0 was decode-only (GGUF arrives pre-quantized); this adds Q4_0Quantizer in commonMain so any source of dense FP32 weights — a SafeTensors/JSON loader, an in-memory tensor, an offline tool — can emit canonical ggml Q4_0 blocks without going through GGUF. This is the loader-agnostic primitive that "Q4_0 from any loader" actually requires.

What

  • Q4_0Quantizer.quantizeToBytes(FloatArray) / .quantize(FloatArray, Shape): Q4_0BlockTensorData.
  • Matches ggml quantize_row_q4_0: per 32-block, d = max/-8, code = clamp(round(x/d + 8), 0, 15), canonical split packing, FP16 round-to-nearest scale.

Tests

  • Q4_0QuantizerTest — round-trip within 4-bit error, max-element recovery, zero-block, validation.
  • Q4_0QuantizeRoundTripMatmulTest — quantized weights run through ctx.ops.matmul and track the dense FP32 result, proving the quantizer output is consumable by the scalar/Panama/native kernels (Phase B ↔ Phase A).

Deliberately deferred

Automatic on-load quantization via a loader policy is not wired here. DTypePolicy targets logical DType, not TensorEncoding — so requesting "Q4_0" needs a new encoding-policy type, an RFC-level API decision (parallel to #615) the maintainer should own. This PR ships the reusable primitive every such path would call; the policy hook is a clean follow-up.

Targeting 0.27.0. Stack #647#650 already merged to develop; this branches off develop. Next: PR5 (docs).

🤖 Generated with Claude Code

Adds Q4_0Quantizer in commonMain — the produce side Q4_0 was missing
(it was decode-only, since GGUF arrives pre-quantized). Now any source
of dense FP32 weights — a SafeTensors/JSON loader, an in-memory tensor,
an offline tool — can emit canonical ggml Q4_0 blocks without GGUF.

Algorithm matches ggml quantize_row_q4_0: per 32-element block, scale
d = max/-8 (max = signed max-magnitude element), code = clamp(round(
x/d + 8), 0, 15), packed in the canonical split layout; scale stored as
round-to-nearest FP16.

Tests:
- Q4_0QuantizerTest — round-trips through Q4_0TensorData.toFloatArray
  within 4-bit error, recovers the max element, zero stays zero.
- Q4_0QuantizeRoundTripMatmulTest — quantized weights run through the
  matmul dispatch and track the dense FP32 result, proving the
  quantizer output is consumable by the (scalar/Panama/native) kernels.

Note: automatic on-load quantization via a loader policy is deliberately
NOT wired here. DTypePolicy targets logical DType, not TensorEncoding,
so requesting "Q4_0" needs a new encoding-policy type — an RFC-level API
decision (parallel to #615) the maintainer should own. This PR ships the
reusable primitive every such path would call.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-651 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal merged commit 8fb4168 into develop May 30, 2026
10 checks passed
@michalharakal michalharakal deleted the feature/q4_0-quantizer branch May 30, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant